Goto

Collaborating Authors

 qe metric


Quality Estimation Reranking for Document-Level Translation

Mrozinski, Krzysztof, Kang, Minji, Khota, Ahmed, Sutanto, Vincent Michael, De Giacomo, Giovanni Gatti

arXiv.org Artificial Intelligence

Quality estimation (QE) reranking is a form of quality-aware decoding which aims to improve machine translation (MT) by scoring and selecting the best candidate from a pool of generated translations. While known to be effective at the sentence level, its application to the increasingly prominent domain of document-level translation remains underexplored. In this work, we evaluate QE reranking performance on document-level (rather than the typical sentence-level) translation, using various learned and large language model (LLM)-based QE metrics. We find that with our best learned metric, SLIDE, BLEURT-20 scores improve by +2.00 with only two candidates, and by +5.09 with 32, across both decoder-only LLM models and encoder-decoder neural machine translation (NMT) models. Using the best LLM-based metric, GEMBA-DA, gains of +1.63 and +4.30 are achieved under the same conditions. Although gains shrink with longer inputs, reranking with 32 candidates yields improvements of +2.34 (SLIDE) and +1.40 (GEMBA-DA) on our longest documents (512-1024 source tokens). These findings demonstrate the practical value of document-level QE, with minimal runtime overhead given suitable translation models and hardware.


GAMBIT+: A Challenge Set for Evaluating Gender Bias in Machine Translation Quality Estimation Metrics

Filandrianos, Giorgos, Mastromichalakis, Orfeas Menis, Mohammed, Wafaa, Attanasio, Giuseppe, Zerva, Chrysoula

arXiv.org Artificial Intelligence

Gender bias in machine translation (MT) systems has been extensively documented, but bias in automatic quality estimation (QE) metrics remains comparatively underexplored. Existing studies suggest that QE metrics can also exhibit gender bias, yet most analyses are limited by small datasets, narrow occupational coverage, and restricted language variety. To address this gap, we introduce a large-scale challenge set specifically designed to probe the behavior of QE metrics when evaluating translations containing gender-ambiguous occupational terms. Building on the GAMBIT corpus of English texts with gender-ambiguous occupations, we extend coverage to three source languages that are genderless or natural-gendered, and eleven target languages with grammatical gender, resulting in 33 source-target language pairs. Each source text is paired with two target versions differing only in the grammatical gender of the occupational term(s) (masculine vs. feminine), with all dependent grammatical elements adjusted accordingly. An unbiased QE metric should assign equal or near-equal scores to both versions. The dataset's scale, breadth, and fully parallel design, where the same set of texts is aligned across all languages, enables fine-grained bias analysis by occupation and systematic comparisons across languages.


Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality Estimation

Zaranis, Emmanouil, Attanasio, Giuseppe, Agrawal, Sweta, Martins, André F. T.

arXiv.org Artificial Intelligence

The automatic assessment of translation quality has recently become crucial across several stages of the translation pipeline, from data curation to training and decoding. Although quality estimation (QE) metrics have been optimized to align with human judgments, no attention has been given to these metrics' potential biases, particularly in reinforcing visibility and usability for some demographic groups over others. This study is the first to investigate gender bias in QE metrics and its downstream impact on machine translation (MT). Focusing on out-of-English translations into languages with grammatical gender, we ask: Do contemporary QE metrics exhibit gender bias? Can the use of contextual information mitigate this bias? How does QE influence gender bias in MT outputs? Experiments with state-of-the-art QE metrics across multiple domains, datasets, and languages reveal significant bias. Masculine-inflected translations score higher than feminine-inflected ones, and gender-neutral translations are penalized. Moreover, context-aware QE metrics reduce errors for masculine-inflected references but fail to address feminine referents, exacerbating gender disparities. Additionally, QE metrics can perpetuate gender bias in MT systems when used in quality-aware decoding. Our findings underscore the need to address gender bias in QE metrics to ensure equitable and unbiased MT systems.


Quality Estimation with $k$-nearest Neighbors and Automatic Evaluation for Model-specific Quality Estimation

Dinh, Tu Anh, Palzer, Tobias, Niehues, Jan

arXiv.org Artificial Intelligence

Providing quality scores along with Machine Translation (MT) output, so-called reference-free Quality Estimation (QE), is crucial to inform users about the reliability of the translation. We propose a model-specific, unsupervised QE approach, termed $k$NN-QE, that extracts information from the MT model's training data using $k$-nearest neighbors. Measuring the performance of model-specific QE is not straightforward, since they provide quality scores on their own MT output, thus cannot be evaluated using benchmark QE test sets containing human quality scores on premade MT output. Therefore, we propose an automatic evaluation method that uses quality scores from reference-based metrics as gold standard instead of human-generated ones. We are the first to conduct detailed analyses and conclude that this automatic method is sufficient, and the reference-based MetricX-23 is best for the task.


There's no Data Like Better Data: Using QE Metrics for MT Data Filtering

Peter, Jan-Thorsten, Vilar, David, Deutsch, Daniel, Finkelstein, Mara, Juraska, Juraj, Freitag, Markus

arXiv.org Artificial Intelligence

Quality Estimation (QE), the evaluation of machine translation output without the need of explicit references, has seen big improvements in the last years with the use of neural metrics. In this paper we analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems~(NMT). While most corpus filtering methods are focused on detecting noisy examples in collections of texts, usually huge amounts of web crawled data, QE models are trained to discriminate more fine-grained quality differences. We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half. We also provide a detailed analysis of the filtering results, which highlights the differences between both approaches.